About Dataset¶

The dataset comprises information on 1000 customers, with 84 features derived from their financial transactions and current financial standing. The primary objective is to leverage this dataset for credit risk estimation and predicting potential defaults.

  • CUST_ID: Unique customer identifier Key Target Variables:

  • CREDIT_SCORE: Numerical target variable representing the customer's credit score (integer)

  • DEFAULT: Binary target variable indicating if the customer has defaulted (1) or not (0) Description of Features:

  • INCOME: Total income in the last 12 months

  • SAVINGS: Total savings in the last 12 months
  • DEBT: Total existing debt
  • R_SAVINGS_INCOME: Ratio of savings to income
  • R_DEBT_INCOME: Ratio of debt to income
  • R_DEBT_SAVINGS: Ratio of debt to savings

Transaction groups (GROCERIES, CLOTHING, HOUSING, EDUCATION, HEALTH, TRAVEL, ENTERTAINMENT, GAMBLING, UTILITIES, TAX, FINES) are categorized.¶

  • T_{GROUP}_6: Total expenditure in that group in the last 6 months
  • T_GROUP_12: Total expenditure in that group in the last 12 months
  • R[GROUP]: Ratio of T[GROUP]6 to T[GROUP]_12
  • R_[GROUP]INCOME: Ratio of T[GROUP]_12 to INCOME
  • R_[GROUP]SAVINGS: Ratio of T[GROUP]_12 to SAVINGS
  • R_[GROUP]DEBT: Ratio of T[GROUP]_12 to DEBT ### Categorical Features:

  • CAT_GAMBLING: Gambling category (none, low, high)

  • CAT_DEBT: 1 if the customer has debt; 0 otherwise
  • CAT_CREDIT_CARD: 1 if the customer has a credit card; 0 otherwise
  • CAT_MORTGAGE: 1 if the customer has a mortgage; 0 otherwise
  • CAT_SAVINGS_ACCOUNT: 1 if the customer has a savings account; 0 otherwise
  • CAT_DEPENDENTS: 1 if the customer has any dependents; 0 otherwise

Our first steps¶

  • Split whole dataset into two parts: for modellers and for validators (70% and 30%)
  • Split the first part into two parts: for training and for validation (70% and 30%)
  • Split validation part from the first part into two parts: for validation and for testing (66% and 34%)
  • Save all parts into separate files
  • Remember to set the random seed for reproducibility
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')
np.random.seed = 42

df=pd.read_csv('../../data/credit_score.csv')
# take without first column and last
X=df.iloc[:,1:-1]
y=df.iloc[:,-1]
In [2]:
df[["CUST_ID"]] 
Out[2]:
CUST_ID
0 C02COQEVYU
1 C02OZKC0ZF
2 C03FHP2D0A
3 C03PVPPHOY
4 C04J69MUX0
... ...
995 CZQHJC9HDH
996 CZRA4MLB0P
997 CZSOD1KVFX
998 CZWC76UAUT
999 CZZV5B3SAL

1000 rows × 1 columns

We will remove ID column, because it is not useful for our model.

In [3]:
X
Out[3]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE_INCOME R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_GAMBLING CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE
0 33269 0 532304 0.0000 16.0000 1.2000 1889 945 0.5003 0.0568 ... 1.0000 0.0000 0.0625 High 1 0 0 0 0 444
1 77158 91187 315648 1.1818 4.0909 3.4615 5818 111 0.0191 0.0754 ... 0.9091 0.7692 0.2222 No 1 0 0 1 0 625
2 30917 21642 534864 0.7000 17.3000 24.7142 1157 860 0.7433 0.0374 ... 1.0000 1.4286 0.0578 High 1 0 0 1 0 469
3 80657 64526 629125 0.8000 7.8000 9.7499 6857 3686 0.5376 0.0850 ... 1.0000 1.2500 0.1282 High 1 0 0 1 0 559
4 149971 1172498 2399531 7.8182 16.0000 2.0465 1978 322 0.1628 0.0132 ... 0.9091 0.1163 0.0568 High 1 1 1 1 1 473
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 328892 1465066 5501471 4.4546 16.7273 3.7551 16701 10132 0.6067 0.0508 ... 0.9091 0.2041 0.0543 High 1 1 1 1 1 418
996 81404 88805 680837 1.0909 8.3637 7.6667 5400 1936 0.3585 0.0663 ... 0.9091 0.8333 0.1087 No 1 0 0 1 0 589
997 0 42428 30760 3.2379 8.1889 0.7250 0 0 0.8779 0.0047 ... 1.0668 0.2500 0.3448 No 1 0 0 1 0 499
998 36011 8002 604181 0.2222 16.7777 75.5037 1993 1271 0.6377 0.0553 ... 1.1111 5.0002 0.0662 No 1 1 0 1 0 507
999 44266 309859 44266 6.9999 1.0000 0.1429 1574 1264 0.8030 0.0356 ... 1.1111 0.1587 1.1111 No 1 0 0 1 0 657

1000 rows × 85 columns

In [4]:
y          
Out[4]:
0      1
1      0
2      1
3      0
4      0
      ..
995    0
996    1
997    0
998    0
999    0
Name: DEFAULT, Length: 1000, dtype: int64
In [5]:
# Split whole dataset into two parts: for modellers and for validators (70% and 30%)
X_mod, X_val, y_mod, y_val = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(X_mod.shape, X_val.shape, y_mod.shape, y_val.shape)
(700, 85) (300, 85) (700,) (300,)
In [6]:
df_validators=pd.concat([X_val, y_val], axis=1)
df_validators.to_csv('../../data/for_validators/credit_score_validators.csv', index=False)
In [7]:
X_train,X_valid,y_train,y_valid = train_test_split(X_mod, y_mod, test_size=0.3, random_state=42, stratify=y_mod)
print(X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)
(490, 85) (210, 85) (490,) (210,)
In [8]:
df_modellers=pd.concat([X_train, y_train], axis=1)
#df_modellers
df_modellers.to_csv('../../data/for_modelling/credit_score_train.csv', index=False)
In [9]:
# Split validation part from the first part into two parts: for validation and for testing (66% and 34%)
X_validation,X_test,y_validation,y_test = train_test_split(X_valid, y_valid, test_size=0.34, random_state=42, stratify=y_valid)
print(X_validation.shape, X_test.shape, y_validation.shape, y_test.shape)
(138, 85) (72, 85) (138,) (72,)
In [10]:
df_valid=pd.concat([X_validation, y_validation], axis=1)
df_valid.to_csv('../../data/for_modelling/credit_score_valid.csv', index=False)
In [11]:
df_test=pd.concat([X_test, y_test], axis=1)
df_test.to_csv('../../data/for_modelling/credit_score_test.csv', index=False)

YEY! We have prepared the data for modellers and validators¶

Let's move to EDA

In [12]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

df_train = pd.read_csv('../../data/for_modelling/credit_score_train.csv')
df_val = pd.read_csv('../../data/for_modelling/credit_score_valid.csv')
df=pd.concat([df_train, df_val])
df.head()
Out[12]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_GAMBLING CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
0 2783 1855 0 0.6665 0.00 0.0000 103 74 0.7184 0.0370 ... 2.5003 0.0000 No 0 0 0 1 0 570 0
1 314430 445442 707468 1.4167 2.25 1.5882 35861 29157 0.8131 0.1141 ... 0.5882 0.3704 High 1 0 1 1 0 691 0
2 161773 517674 2782496 3.2000 17.20 5.3750 3716 2533 0.6816 0.0230 ... 0.3125 0.0581 No 1 1 1 1 1 520 0
3 16014 97685 20818 6.1000 1.30 0.2131 637 187 0.2936 0.0398 ... 0.1639 0.7692 No 1 0 0 1 0 654 0
4 193225 1410542 2589215 7.3000 13.40 1.8356 5276 2325 0.4407 0.0273 ... 0.1370 0.0746 No 1 1 0 1 1 552 0

5 rows × 86 columns

In [13]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 628 entries, 0 to 137
Data columns (total 86 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   INCOME                   628 non-null    int64  
 1   SAVINGS                  628 non-null    int64  
 2   DEBT                     628 non-null    int64  
 3   R_SAVINGS_INCOME         628 non-null    float64
 4   R_DEBT_INCOME            628 non-null    float64
 5   R_DEBT_SAVINGS           628 non-null    float64
 6   T_CLOTHING_12            628 non-null    int64  
 7   T_CLOTHING_6             628 non-null    int64  
 8   R_CLOTHING               628 non-null    float64
 9   R_CLOTHING_INCOME        628 non-null    float64
 10  R_CLOTHING_SAVINGS       628 non-null    float64
 11  R_CLOTHING_DEBT          628 non-null    float64
 12  T_EDUCATION_12           628 non-null    int64  
 13  T_EDUCATION_6            628 non-null    int64  
 14  R_EDUCATION              628 non-null    float64
 15  R_EDUCATION_INCOME       628 non-null    float64
 16  R_EDUCATION_SAVINGS      628 non-null    float64
 17  R_EDUCATION_DEBT         628 non-null    float64
 18  T_ENTERTAINMENT_12       628 non-null    int64  
 19  T_ENTERTAINMENT_6        628 non-null    int64  
 20  R_ENTERTAINMENT          628 non-null    float64
 21  R_ENTERTAINMENT_INCOME   628 non-null    float64
 22  R_ENTERTAINMENT_SAVINGS  628 non-null    float64
 23  R_ENTERTAINMENT_DEBT     628 non-null    float64
 24  T_FINES_12               628 non-null    int64  
 25  T_FINES_6                628 non-null    int64  
 26  R_FINES                  628 non-null    float64
 27  R_FINES_INCOME           628 non-null    float64
 28  R_FINES_SAVINGS          628 non-null    float64
 29  R_FINES_DEBT             628 non-null    float64
 30  T_GAMBLING_12            628 non-null    int64  
 31  T_GAMBLING_6             628 non-null    int64  
 32  R_GAMBLING               628 non-null    float64
 33  R_GAMBLING_INCOME        628 non-null    float64
 34  R_GAMBLING_SAVINGS       628 non-null    float64
 35  R_GAMBLING_DEBT          628 non-null    float64
 36  T_GROCERIES_12           628 non-null    int64  
 37  T_GROCERIES_6            628 non-null    int64  
 38  R_GROCERIES              628 non-null    float64
 39  R_GROCERIES_INCOME       628 non-null    float64
 40  R_GROCERIES_SAVINGS      628 non-null    float64
 41  R_GROCERIES_DEBT         628 non-null    float64
 42  T_HEALTH_12              628 non-null    int64  
 43  T_HEALTH_6               628 non-null    int64  
 44  R_HEALTH                 628 non-null    float64
 45  R_HEALTH_INCOME          628 non-null    float64
 46  R_HEALTH_SAVINGS         628 non-null    float64
 47  R_HEALTH_DEBT            628 non-null    float64
 48  T_HOUSING_12             628 non-null    int64  
 49  T_HOUSING_6              628 non-null    int64  
 50  R_HOUSING                628 non-null    float64
 51  R_HOUSING_INCOME         628 non-null    float64
 52  R_HOUSING_SAVINGS        628 non-null    float64
 53  R_HOUSING_DEBT           628 non-null    float64
 54  T_TAX_12                 628 non-null    int64  
 55  T_TAX_6                  628 non-null    int64  
 56  R_TAX                    628 non-null    float64
 57  R_TAX_INCOME             628 non-null    float64
 58  R_TAX_SAVINGS            628 non-null    float64
 59  R_TAX_DEBT               628 non-null    float64
 60  T_TRAVEL_12              628 non-null    int64  
 61  T_TRAVEL_6               628 non-null    int64  
 62  R_TRAVEL                 628 non-null    float64
 63  R_TRAVEL_INCOME          628 non-null    float64
 64  R_TRAVEL_SAVINGS         628 non-null    float64
 65  R_TRAVEL_DEBT            628 non-null    float64
 66  T_UTILITIES_12           628 non-null    int64  
 67  T_UTILITIES_6            628 non-null    int64  
 68  R_UTILITIES              628 non-null    float64
 69  R_UTILITIES_INCOME       628 non-null    float64
 70  R_UTILITIES_SAVINGS      628 non-null    float64
 71  R_UTILITIES_DEBT         628 non-null    float64
 72  T_EXPENDITURE_12         628 non-null    int64  
 73  T_EXPENDITURE_6          628 non-null    int64  
 74  R_EXPENDITURE            628 non-null    float64
 75  R_EXPENDITURE_INCOME     628 non-null    float64
 76  R_EXPENDITURE_SAVINGS    628 non-null    float64
 77  R_EXPENDITURE_DEBT       628 non-null    float64
 78  CAT_GAMBLING             628 non-null    object 
 79  CAT_DEBT                 628 non-null    int64  
 80  CAT_CREDIT_CARD          628 non-null    int64  
 81  CAT_MORTGAGE             628 non-null    int64  
 82  CAT_SAVINGS_ACCOUNT      628 non-null    int64  
 83  CAT_DEPENDENTS           628 non-null    int64  
 84  CREDIT_SCORE             628 non-null    int64  
 85  DEFAULT                  628 non-null    int64  
dtypes: float64(51), int64(34), object(1)
memory usage: 426.8+ KB
In [14]:
#how many categorical and numerical columns
cat_cols = df.select_dtypes(include='object').columns
num_cols = df.select_dtypes(include='number').columns
print(f'categorical columns: {cat_cols}')
print(f'numerical columns: {num_cols}')
print(f'number of categorical columns: {len(cat_cols)}')
print(f'number of numerical columns: {len(num_cols)}')
categorical columns: Index(['CAT_GAMBLING'], dtype='object')
numerical columns: Index(['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME', 'R_DEBT_INCOME',
       'R_DEBT_SAVINGS', 'T_CLOTHING_12', 'T_CLOTHING_6', 'R_CLOTHING',
       'R_CLOTHING_INCOME', 'R_CLOTHING_SAVINGS', 'R_CLOTHING_DEBT',
       'T_EDUCATION_12', 'T_EDUCATION_6', 'R_EDUCATION', 'R_EDUCATION_INCOME',
       'R_EDUCATION_SAVINGS', 'R_EDUCATION_DEBT', 'T_ENTERTAINMENT_12',
       'T_ENTERTAINMENT_6', 'R_ENTERTAINMENT', 'R_ENTERTAINMENT_INCOME',
       'R_ENTERTAINMENT_SAVINGS', 'R_ENTERTAINMENT_DEBT', 'T_FINES_12',
       'T_FINES_6', 'R_FINES', 'R_FINES_INCOME', 'R_FINES_SAVINGS',
       'R_FINES_DEBT', 'T_GAMBLING_12', 'T_GAMBLING_6', 'R_GAMBLING',
       'R_GAMBLING_INCOME', 'R_GAMBLING_SAVINGS', 'R_GAMBLING_DEBT',
       'T_GROCERIES_12', 'T_GROCERIES_6', 'R_GROCERIES', 'R_GROCERIES_INCOME',
       'R_GROCERIES_SAVINGS', 'R_GROCERIES_DEBT', 'T_HEALTH_12', 'T_HEALTH_6',
       'R_HEALTH', 'R_HEALTH_INCOME', 'R_HEALTH_SAVINGS', 'R_HEALTH_DEBT',
       'T_HOUSING_12', 'T_HOUSING_6', 'R_HOUSING', 'R_HOUSING_INCOME',
       'R_HOUSING_SAVINGS', 'R_HOUSING_DEBT', 'T_TAX_12', 'T_TAX_6', 'R_TAX',
       'R_TAX_INCOME', 'R_TAX_SAVINGS', 'R_TAX_DEBT', 'T_TRAVEL_12',
       'T_TRAVEL_6', 'R_TRAVEL', 'R_TRAVEL_INCOME', 'R_TRAVEL_SAVINGS',
       'R_TRAVEL_DEBT', 'T_UTILITIES_12', 'T_UTILITIES_6', 'R_UTILITIES',
       'R_UTILITIES_INCOME', 'R_UTILITIES_SAVINGS', 'R_UTILITIES_DEBT',
       'T_EXPENDITURE_12', 'T_EXPENDITURE_6', 'R_EXPENDITURE',
       'R_EXPENDITURE_INCOME', 'R_EXPENDITURE_SAVINGS', 'R_EXPENDITURE_DEBT',
       'CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT',
       'CAT_DEPENDENTS', 'CREDIT_SCORE', 'DEFAULT'],
      dtype='object')
number of categorical columns: 1
number of numerical columns: 85
In [15]:
#check CAT_GAMBLING values
df['CAT_GAMBLING'].value_counts()
Out[15]:
CAT_GAMBLING
No      392
High    163
Low      73
Name: count, dtype: int64

Small Summary¶

  • No missing values :)
  • 1 categorical column
  • 85 numerical columns
In [16]:
df.describe()
Out[16]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE_INCOME R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
count 628.000000 6.280000e+02 6.280000e+02 628.000000 628.000000 628.000000 628.000000 628.000000 628.000000 628.000000 ... 628.000000 628.000000 628.000000 628.000000 628.000000 628.00000 628.000000 628.000000 628.000000 628.000000
mean 122852.219745 4.234954e+05 8.315110e+05 4.251653 6.050788 5.730516 6793.633758 3445.570064 0.461276 0.054612 ... 0.939313 0.897402 0.590555 0.944268 0.243631 0.18949 0.995223 0.171975 587.060510 0.283439
std 111735.273010 4.459637e+05 1.013800e+06 4.045379 5.793431 15.021525 7456.934234 5146.191291 0.236514 0.035962 ... 0.165736 1.597915 1.269410 0.229587 0.429614 0.39221 0.069006 0.377659 62.146964 0.451027
min 0.000000 0.000000e+00 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.003300 0.003700 ... 0.666700 0.067600 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 300.000000 0.000000
25% 30494.500000 7.529625e+04 4.978375e+04 1.100000 1.500000 0.203100 1127.000000 295.000000 0.269450 0.029675 ... 0.833300 0.148750 0.099000 1.000000 0.000000 0.00000 1.000000 0.000000 555.000000 0.000000
50% 86994.000000 2.946345e+05 4.056500e+05 2.713650 4.883350 2.000000 4380.000000 1319.500000 0.475500 0.045550 ... 0.909100 0.303000 0.178600 1.000000 0.000000 0.00000 1.000000 0.000000 593.500000 0.000000
75% 183004.500000 6.272762e+05 1.329519e+06 6.999925 8.725000 4.540175 10080.750000 4638.000000 0.630000 0.067200 ... 1.000000 0.833300 0.588225 1.000000 0.000000 0.00000 1.000000 0.000000 630.000000 1.000000
max 545667.000000 2.738164e+06 5.968620e+06 16.111200 37.000600 166.978800 43255.000000 39918.000000 1.058300 0.241300 ... 2.000000 10.009900 10.005300 1.000000 1.000000 1.00000 1.000000 1.000000 800.000000 1.000000

8 rows × 85 columns

In [17]:
target='DEFAULT'
df[target].value_counts()
Out[17]:
DEFAULT
0    450
1    178
Name: count, dtype: int64

We are looking for the best model to predict the target variable (DEFAULT)¶

In [23]:
# encoding categorical variable CAT_GAMBLING to numerical
df['CAT_GAMBLING'] = df['CAT_GAMBLING'].map({'No': 0, 'Low': 1,'High': 2})
df
Out[23]:
INCOME SAVINGS DEBT R_SAVINGS_INCOME R_DEBT_INCOME R_DEBT_SAVINGS T_CLOTHING_12 T_CLOTHING_6 R_CLOTHING R_CLOTHING_INCOME ... R_EXPENDITURE_SAVINGS R_EXPENDITURE_DEBT CAT_GAMBLING CAT_DEBT CAT_CREDIT_CARD CAT_MORTGAGE CAT_SAVINGS_ACCOUNT CAT_DEPENDENTS CREDIT_SCORE DEFAULT
0 2783 1855 0 0.6665 0.0000 0.0000 103 74 0.7184 0.0370 ... 2.5003 0.0000 0 0 0 0 1 0 570 0
1 314430 445442 707468 1.4167 2.2500 1.5882 35861 29157 0.8131 0.1141 ... 0.5882 0.3704 2 1 0 1 1 0 691 0
2 161773 517674 2782496 3.2000 17.2000 5.3750 3716 2533 0.6816 0.0230 ... 0.3125 0.0581 0 1 1 1 1 1 520 0
3 16014 97685 20818 6.1000 1.3000 0.2131 637 187 0.2936 0.0398 ... 0.1639 0.7692 0 1 0 0 1 0 654 0
4 193225 1410542 2589215 7.3000 13.4000 1.8356 5276 2325 0.4407 0.0273 ... 0.1370 0.0746 0 1 1 0 1 1 552 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
133 386976 1250231 654883 3.2308 1.6923 0.5238 36665 30569 0.8337 0.0947 ... 0.2381 0.4545 0 1 0 0 1 0 562 0
134 16772 8386 0 0.5000 0.0000 0.0000 238 186 0.7815 0.0142 ... 2.5000 0.9786 0 0 0 0 1 0 619 0
135 137509 206264 206264 1.5000 1.5000 1.0000 15661 10865 0.6938 0.1139 ... 0.6667 0.6667 0 1 0 0 1 0 639 1
136 259146 1922000 1662854 7.4167 6.4167 0.8652 12180 8527 0.7001 0.0470 ... 0.1124 0.1299 2 1 1 0 1 1 587 1
137 56657 442952 51506 7.8181 0.9091 0.1163 1085 289 0.2664 0.0192 ... 0.1163 1.0000 0 1 0 0 1 0 635 1

628 rows × 86 columns

We have encoded CAT_GAMBLING

  • No: 0
  • Low: 1
  • High: 2 (validators didn't like that)
In [19]:
df_without_cat = df.drop(columns=['CAT_GAMBLING'])

On the advice of the validators, we remove categorical variables from the correlation matrix.

In [20]:
# correlation matrix
corr = df_without_cat.corr()
plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=False, cmap='coolwarm', fmt=".2f")
plt.show()
In [21]:
high_correlation_matrix= corr[abs(corr)>0.7]
#plot
plt.figure(figsize=(20, 20))
sns.heatmap(high_correlation_matrix, annot=False, cmap='coolwarm', fmt=".2f")
plt.show()

BIG Canditates for removal - high correlation with other variables¶

  • T_TAX_6
  • T_TAX_12
  • T_UTILITIES_6
  • T_UTILITIES_12
  • T_EXPENDITURE_6
  • T_EXPENDITURE_12
  • T_GROCERIES_6
  • T_GROCERIES_12

Small Canditates for removal - high correlation with other variables¶

  • T_HEALTH_6
  • T_HEALTH_12
  • T_TRAVEL_6
  • T_TRAVEL_12

Summary¶

We have 3 similar columns

  • T_{GROUP}_6
  • T_{GROUP}_12
  • R_{GROUP}

We might remove 2 of them but 100% sure that we will remove 1 of them

R_DEBT_INCOME and CREDIT_SCORE are highly correlated.

Let's look closer at correlation between CREDIT_SCORE, DEFAULT and other variables¶

In [24]:
correlation_matrix = df.corr()
plt.figure(figsize=(12, 20))
sns.heatmap(correlation_matrix[['DEFAULT','CREDIT_SCORE']].sort_values(by='DEFAULT', ascending=False), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation with CREDIT_SCORE, DEFAULT and other variables')
plt.show()

Summary¶

  • There is a clear negative correlation between CREDIT_SCORE and DEFAULT. This indicates that lower credit scores are associated with a higher risk of default. This result is expected, as the credit score is an indicator of an individual's creditworthiness

  • The variable R_DEBT_INCOME shows the highest positive correlation with DEFAULT, suggesting that a higher debt-to-income ratio may increase the risk of default. This is understandable, as a higher debt-to-income ratio may indicate a greater financial burden relative to a person's earning capacity. However, we can also see that this variable has the highest negative correlation with Credit Score, which suggests that this will be a very important variable in predicting default and credit score

  • Variables like CAT_MORTGAGE, CAT_SAVINGS_ACCOUNT and those related to expenses (such as T_GROCERIES_12, T_GAMBLING_12, T_UTILITIES_6) show low correlation both with CREDIT_SCORE and DEFAULT, which might suggest that they do not significantly affect credit risk and default

In [10]:
# show DEBT and CAT_DEBT column
df[['DEBT', 'CAT_DEBT']]
Out[10]:
DEBT CAT_DEBT
0 0 0
1 707468 1
2 2782496 1
3 20818 1
4 2589215 1
... ... ...
133 654883 1
134 0 0
135 206264 1
136 1662854 1
137 51506 1

628 rows × 2 columns

DEBT and CAT_DEBT¶

  • We will remove CAT_DEBT column, because this information is given in DEBT column
In [25]:
# zobaczmy rozklad wszystkich wartości, oprócz tych gdzie wartości są z przedziału tylko 0 lub 1
columns_to_plot = [col for col in df.columns if not df[col].nunique() in [1, 2]]

df[columns_to_plot].hist(bins=40, figsize=(20, 20))
plt.tight_layout()
plt.show()

Summary¶

FEATURES THAT HAVE NORMAL DISTRIBUTION

  • R_CLOTHING (A LITTLE BIT)
  • R_EDUCATION
  • R_FINES
  • R_GAMBLING
  • R_HEALTH
  • R_TRAVEL (A LITTLE BIT)
  • R_UTILITIES
  • R_UTILITIES_INCOME
  • R_EXPENDITURE
  • CREDIT_SCORE

CANDIDATES - FEATURES THAT CAN BE TRANSFORMED TO NORMAL DISTRIBUTION PROBABLY

  • INCOME
  • SAVINGS
  • DEBT
  • R_SAVINGS_INCOME
  • R_DEBT_INCOME
  • T_CLOTHING_6
  • T_CLOTHING_12
  • R_CLOTHING_INCOME
  • T_ENTERTAINMENT_6
  • T_ENTERTAINMENT_12
  • R_ENTERTAINMENT_INCOME
  • T_GROCIERIES_6
  • T_GROCIERIES_12
  • R_GROCIERIES_INCOME
  • T_HEALTH_6
  • T_HEALTH_12
  • R_HEALTH_INCOME
  • R_TAX_INCOME
  • T_TRAVEL_6
  • T_TRAVEL_12
  • R_TRAVEL_INCOME
  • T_EXPENDITURE_6
  • T_EXPENDITURE_12

INTERESTING FEATURES IDK WHAT TO DO WITH THEM

  • T_UTILITIES_6
  • T_UTILITIES_12
In [26]:
# WE ARE SHOWING SCATTERPLOT FOR ALL FEATURES EXCEPT DEFAULT
fig, axs = plt.subplots(28, 3, figsize=(16, 100))
axs = axs.flatten()
for i, col in enumerate(df.drop(columns=['CREDIT_SCORE']).columns):
    sns.scatterplot(data=df, x=col, y=df["CREDIT_SCORE"], ax=axs[i], s=15, hue='DEFAULT')
plt.suptitle("Target variable vs. features")
plt.tight_layout()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[26], line 5
      3 axs = axs.flatten()
      4 for i, col in enumerate(df.drop(columns=['CREDIT_SCORE']).columns):
----> 5     sns.scatterplot(data=df, x=col, y=df["CREDIT_SCORE"], ax=axs[i], s=15, hue='DEFAULT')
      6 plt.suptitle("Target variable vs. features")
      7 plt.tight_layout()

IndexError: index 84 is out of bounds for axis 0 with size 84

Summary¶

(MY DOPISZEMY)

In [34]:
sns.boxplot(x='CAT_GAMBLING', y='CREDIT_SCORE', data=df)
plt.title('Boxplot of CREDIT_SCORE by CAT_GAMBLING')
plt.xlabel('CAT_GAMBLING')
plt.ylabel('CREDIT_SCORE')
plt.show()

On the advice of the validators we make a boxplot for the categorical data. We can see that credit score is mostly lower, when cat_gambling is high.

In [38]:
# Wybieranie kolumn, których nazwy zaczynają się od "CAT"
cat_columns = df.filter(regex=r'^CAT')

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Tworzenie wykresu słupkowego dla każdej kolumny z "CAT" w zależności od "CREDIT_SCORE"
for i, column in enumerate(cat_columns):
    row_index = i // 3  # Indeks wiersza
    col_index = i % 3   # Indeks kolumny
    ax = axes[row_index, col_index]
    df.groupby(column)['CREDIT_SCORE'].mean().plot(kind='bar', ax=ax, title=f'Average CREDIT_SCORE by {column}')
    ax.set_xlabel(column)
    ax.set_ylabel('Average CREDIT_SCORE')
    ax.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

,,binary variables could be examined with bar charts' ~validators:) So we did it. summary:

  • saving account and cat_debt have the most impact on credit_score (from categorical values)
  • cat_mortgage doesn't impact credit_score
In [37]:
average_default = df.groupby('CAT_GAMBLING')['DEFAULT'].mean()
print(average_default)
CAT_GAMBLING
0    0.260204
1    0.273973
2    0.343558
Name: DEFAULT, dtype: float64

Summary¶

The higher 'CAT_GAMBLING' the higher 'DEFAULT' (so higher chance of not paying your debt)

In [13]:
# show my only R_Fines_SAvings column
df[['R_FINES_SAVINGS']]
Out[13]:
R_FINES_SAVINGS
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
... ...
133 0.0
134 0.0
135 0.0
136 0.0
137 0.0

628 rows × 1 columns

In [ ]:
 

Let's look closer at income

In [43]:
bins = [0, 10000, 30000, 60000, 100000, 150000, 662094]
# Obliczanie koszyków automatycznie
df['income_cat'] = pd.cut(df['INCOME'], bins=bins)

average_default = df.groupby('income_cat')['DEFAULT'].mean()
print(average_default)
income_cat
(0, 10000]          0.153846
(10000, 30000]      0.336735
(30000, 60000]      0.222222
(60000, 100000]     0.259740
(100000, 150000]    0.282353
(150000, 662094]    0.305164
Name: DEFAULT, dtype: float64

It seems that income doesn't have much influence on repaying the loan, but We didn't consider the possibility that the more someone earns, the larger loan they may take. It might be the case that smaller loans are more often repaid by those who earn more, but those who earn less don't take out large loans, which possibly might be repaid "worse."

In [46]:
average_default = df.groupby('DEFAULT')['INCOME'].mean()
print(average_default)
DEFAULT
0    119266.653333
1    131916.853933
Name: INCOME, dtype: float64

Exactly :)

In [47]:
average_default = df.groupby('DEFAULT')['SAVINGS'].mean()
print(average_default)
DEFAULT
0    418978.322222
1    434914.865169
Name: SAVINGS, dtype: float64

It's the same with savings.

In [48]:
# Let's look closer at INCOME
low_debt = df[df['DEBT'] <= 5.396675e+04]


average_default = low_debt.groupby('income_cat')['DEFAULT'].mean()
print(average_default)

average_debt = df[(df['DEBT'] > 5.396675e+04) & (df['DEBT'] <= 3.950955e+05)]

average_default2 = average_debt.groupby('income_cat')['DEFAULT'].mean()
print(average_default2)

high_debt = df[(df['DEBT'] > 3.950955e+05) & (df['DEBT'] <= 1.193230e+06)]

average_default2 = high_debt.groupby('income_cat')['DEFAULT'].mean()
print(average_default2)
highest_debt = df[df['DEBT'] > 1.193230e+06]

average_default2 = highest_debt.groupby('income_cat')['DEFAULT'].mean()
print(average_default2)
income_cat
(0, 10000]          0.181818
(10000, 30000]      0.200000
(30000, 60000]      0.159091
(60000, 100000]     0.000000
(100000, 150000]         NaN
(150000, 662094]         NaN
Name: DEFAULT, dtype: float64
income_cat
(0, 10000]          0.000000
(10000, 30000]      0.400000
(30000, 60000]      0.266667
(60000, 100000]     0.219512
(100000, 150000]    0.157895
(150000, 662094]    0.111111
Name: DEFAULT, dtype: float64
income_cat
(0, 10000]               NaN
(10000, 30000]      0.923077
(30000, 60000]      0.300000
(60000, 100000]     0.333333
(100000, 150000]    0.238095
(150000, 662094]    0.193548
Name: DEFAULT, dtype: float64
income_cat
(0, 10000]               NaN
(10000, 30000]           NaN
(30000, 60000]           NaN
(60000, 100000]     1.000000
(100000, 150000]    0.458333
(150000, 662094]    0.366197
Name: DEFAULT, dtype: float64

Summary¶

As we suspected, individuals with low income don't take out very large loans. With the division into ranges of loan values, it's already noticeable that income has a significant impact on loan repayment: the lower the income, the more unpaid loans there are.

Bonus¶

We will build a model to check importance of features

In [27]:
df_train = pd.read_csv('../../data/for_modelling/credit_score_train.csv')
df_val = pd.read_csv('../../data/for_modelling/credit_score_valid.csv')
df=pd.concat([df_train, df_val])
#map CAT_GAMBLING to numerical
df['CAT_GAMBLING'] = df['CAT_GAMBLING'].map({'No': 0, 'Low': 1,'High': 2})
X=df.drop(columns=['DEFAULT'])
y=df['DEFAULT']
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
importances = model.feature_importances_
features = X.columns
forest_importances = pd.Series(importances, index=features).sort_values(ascending=False)
forest_importances
Out[27]:
CREDIT_SCORE            0.042118
R_DEBT_INCOME           0.026159
R_TAX_DEBT              0.024946
R_UTILITIES_DEBT        0.024116
R_ENTERTAINMENT_DEBT    0.021476
                          ...   
CAT_GAMBLING            0.000949
CAT_MORTGAGE            0.000904
CAT_DEBT                0.000376
CAT_DEPENDENTS          0.000259
CAT_SAVINGS_ACCOUNT     0.000000
Length: 85, dtype: float64
In [28]:
plt.figure(figsize=(20, 20))
forest_importances.plot(kind='barh')
Out[28]:
<Axes: >

Summary¶

  • Lots of features staring with R_ are important, we SHOULD NOT remove them
  • CREDIT_SCORE is the most important feature
  • R_DEBT_INCOME is the second most important feature
  • R_TAX_DEBT is the third most important feature
  • lots of unexpected features are important

According to this model we could remove:

  • CAT_SAVINGS_ACCOUNT
  • CAT_DEPENDENTS
  • CAT_MORTGAGE
  • CAT_GAMBLING
  • CAT_DEBT etc.
In [ ]: